Allow custom check messages by ghanse · Pull Request #1092 · databrickslabs/dqx

ghanse · 2026-03-20T22:22:02Z

Changes

Add an optional message callable parameter to DQRule, DQRowRule, DQDatasetRule, and DQForEachColRule that allows users to define custom check failure messages. The callable receives rule context (rule_name, check_func_name, check_func_args, column_value) and returns a Spark Column expression, enabling dynamic messages that can include column values.

When message is None (the default), the existing auto-generated message behavior is preserved. When provided, the custom message replaces the default message for failed rows while maintaining null/non-null semantics for passing rows.

Linked issues

Resolves #958

Tests

manually tested
added unit tests
added integration tests

Co-authored-by: Isaac

github-actions · 2026-03-20T23:11:35Z

✅ 738/738 passed, 41 skipped, 4h43m32s total

_{Running from acceptance #4562}

codecov · 2026-04-20T07:11:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.13%. Comparing base (0c308bd) to head (4560ac6).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1092      +/-   ##
==========================================
- Coverage   92.06%   91.13%   -0.94%     
==========================================
  Files         101      101              
  Lines        9546     9557      +11     
==========================================
- Hits         8789     8710      -79     
- Misses        757      847      +90

Flag	Coverage Δ
anomaly	`53.56% <72.72%> (+0.02%)`	⬆️
anomaly-serverless	`53.58% <72.72%> (+0.08%)`	⬆️
integration	`49.67% <72.72%> (-0.80%)`	⬇️
integration-serverless	`?`
unit	`55.91% <81.81%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…eholders The custom-message field is renamed to message_expr and now accepts either a Spark SQL expression string or a Spark Column object. The expression is evaluated as-is — DQX no longer substitutes {rule_name}, {check_func_name}, or {column_value} placeholders. Callers reference columns and identifiers directly inside the expression. - DQRule / DQForEachColRule: rename message -> message_expr (str | Column | None) - to_dict serialises message_expr only when it is a string (Column is in-process) - DQRuleManager._build_message_col simplified: F.expr(str) or use Column directly - checks_serializer: read message_expr key from metadata, pass to constructors - Unit tests rewritten for the new attribute name and Column variant - Integration tests rewritten without placeholders; the existing test_apply_checks_without_custom_message_unchanged is dropped (the Engine's no-message default path is already covered by surrounding tests) - Docs (quality_checks.mdx) rewritten to drop the placeholder section, add a Column-based Python example, and annotate each example with the rendered message string Co-authored-by: Isaac

mwojtyczka

Small, well-isolated change with good happy-path test coverage. Posting 11 inline comments below covering: a security exposure (arbitrary F.expr evaluation of strings sourced from YAML/metadata), a UX trap with unquoted strings, several maintainability/polish items, and one test bug.

One non-inline issue: PR description doesn't match the implementation.

The description says:

Add an optional message callable parameter ... that allows users to define custom check failure messages. The callable receives rule context (rule_name, check_func_name, check_func_args, column_value) and returns a Spark Column expression…

But the actual API is message_expr: str | Column | None — no callable, no rule context. Either the description should be rewritten to match what shipped, or the implementation should provide the callable form the description promises. Reviewers (and future readers running git log / GitHub PR archeology) rely on the description for context.

mwojtyczka · 2026-05-08T19:44:32Z

+            return condition
+
+        custom_message = (
+            self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr)


Arbitrary Spark SQL evaluation of strings from non-trusted sources.

F.expr(self.check.message_expr) runs raw SQL with no validation. When checks come from YAML files, table-stored checks, or any persisted metadata source, this becomes an arbitrary-expression evaluation surface:

- check: { function: is_not_null, arguments: { column: id } } message_expr: "concat('hi ', (select secret_key from secrets.config limit 1))"

Not a classic SQL injection (Spark expr doesn't allow DDL/DML), but it does allow:

Subqueries against any catalog the running session can read.

Calls to potentially expensive UDFs (reflect, java_method, custom Spark UDFs).

Information disclosure via embedded subqueries that surface secrets in the violation message.

The project's CLAUDE.md security policy states:

User-provided or templated SQL must be validated with is_sql_query_safe() from utils.py before execution. Raise UnsafeSqlQueryError when the query is unsafe.

Options:

Validate message_expr strings via is_sql_query_safe() (or a stricter subset that disallows subqueries) before passing to F.expr.

Restrict the string form to literal/concat/coalesce patterns only.

At minimum, document message_expr strings as trusted input so metadata sources validate before storage.

mwojtyczka · 2026-05-08T19:44:32Z

+            return condition
+
+        custom_message = (
+            self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr)


UX trap: a plain string passes through F.expr and fails confusingly.

A new user will write:

DQRowRule(..., message_expr="Email must not be null")

F.expr("Email must not be null") parses this as Email (column ref) must (column ref) not be null and fails with a confusing column-not-found error far from the rule definition.

The docs example correctly shows "'Email must not be null'" (Python double-quote wrapping a SQL single-quoted literal), but that's an awkward shape and easy to forget.

Options:

Auto-detect plain strings (no parens, no SQL operators) and wrap with F.lit automatically.

Best-effort validate at construction time (__post_init__ tries F.expr once and surfaces a clear error mentioning the SQL-quote requirement).

At minimum, expand the docs Admonition with a "common mistake" callout showing the unquoted form failing.

mwojtyczka · 2026-05-08T19:44:32Z

+        custom_message = (
+            self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr)
+        )
+        return F.when(condition.isNotNull(), custom_message).otherwise(F.lit(None).cast("string"))


No length cap on rendered message.

A pathological expression like concat(repeat('x', 100000)) produces a 100k-char message embedded in every result row's struct — bloating Delta files and breaking downstream consumers that assume short messages. Other parts of DQX cap at 500 chars (_LLM_FIELD_MAX_LEN in the anomaly explainer).

Consider wrapping with substring(message, 1, MAX_MESSAGE_LEN) here, or applying the cap as a Spark expression on custom_message before the F.when.

mwojtyczka · 2026-05-08T19:44:32Z

    user_metadata: dict[str, str] | None = None
+    message_expr: str | Column | None = None

    def __post_init__(self):


Add best-effort message_expr validation in __post_init__.

A typo'd SQL string ("concat(invalid") currently fails only when the rule is applied to a DataFrame, far from where it was defined. A best-effort validation here surfaces the error at the call site:

if isinstance(self.message_expr, str) and self.message_expr: try: F.expr(self.message_expr) except Exception as exc: raise InvalidParameterError( f"message_expr is not a valid Spark SQL expression: {exc}. " f"Note: literal strings must be wrapped in SQL quotes, e.g. \"'my message'\"." ) from exc

Cheap (build-only — no DataFrame eval), and the error message hints at the unquoted-string trap in the previous comment.

mwojtyczka · 2026-05-08T19:44:32Z

    check_func_args: list[Any] = field(default_factory=list)
    check_func_kwargs: dict[str, Any] = field(default_factory=dict)
    user_metadata: dict[str, str] | None = None
+    message_expr: str | Column | None = None


Overloaded parameter type — consider splitting before this becomes public API.

message_expr: str | Column forces callers (and serialisers, and future API extensions) to runtime-dispatch on type. Cleaner alternatives:

message_sql: str | None and message_col: Column | None with mutual-exclusion validation.

A single message: Column | None plus a message_from_sql(s: str) -> Column helper that wraps F.expr with the validation/sanitisation from the prior comments.

Not a blocker — your current shape works — but worth deciding now since changing the field name later is a breaking change.

mwojtyczka · 2026-05-08T19:44:32Z

    * *check_func_args* (optional) - Positional arguments for the check function (excluding *column*).
    * *check_func_kwargs* (optional) - Keyword arguments for the check function (excluding *column*).
    * *user_metadata* (optional) - User-defined key-value pairs added to metadata generated by the check.
+    * *message_expr* (optional) - User-defined expression used as the check failure message. Accepts either


No placeholder support for rule_name / check_func_name / column_value — document the limitation.

The PR description lists these as expected context, but the implementation evaluates the expression as-is. Users who want a message like "<rule_name> failed for value <column_value>" have to hand-type the rule name twice and inline the column reference (and the column reference doesn't generalise across DQForEachColRule instances).

This will be a frequent feature request as a follow-up. Worth either:

Documenting the limitation explicitly in the docstring ("placeholder substitution is not supported; reference column names directly via Spark SQL").

Considering whether to add {rule_name}, {column_value} placeholder substitution in this PR or as a tracked follow-up. If you choose to defer, link to a follow-up issue here so users searching the codebase find it.

mwojtyczka · 2026-05-08T19:44:32Z

+        # Only string expressions can be round-tripped through metadata; Column objects are
+        # in-process Spark expressions with no canonical YAML/JSON representation.
+        if isinstance(self.message_expr, str) and self.message_expr:
+            metadata["message_expr"] = self.message_expr


to_dict() silently drops Column-typed message_expr — users won't notice.

A user who passes message_expr=F.concat(...) and calls to_dict() to persist their rules to YAML/Delta loses the message with no signal. They'll discover it only when re-loading and seeing the default message instead.

Options:

Log a WARNING when a Column-typed message_expr is encountered during serialisation: "Column message_expr cannot be serialised; falling back to default message on round-trip."

Or raise InvalidParameterError("Column-typed message_expr cannot be serialised; pass a SQL expression string if you need round-trip persistence").

Logging is friendlier; raising forces the user to make an explicit choice.

mwojtyczka · 2026-05-08T19:44:32Z

    check_func_args: list[Any] = field(default_factory=list)
    check_func_kwargs: dict[str, Any] = field(default_factory=dict)
    user_metadata: dict[str, str] | None = None
+    message_expr: str | Column | None = None


DQForEachColRule.message_expr is shared across all generated rules — flag in the field comment.

The integration test correctly notes this ("The same expression is used for every generated rule. To produce a per-column message, reference each column inline"), but the field itself has no docstring or warning. A user expecting per-column message substitution will be surprised when all generated rules emit identical messages — and won't know the workaround.

Add a one-line comment on the field describing the shared-expression behaviour, ideally with the inline-reference workaround pointer:

# Shared across every generated rule; reference column names inline (e.g. "concat('a=', cast(a as string))") # if you need per-column messages, or construct rules individually instead of via DQForEachColRule. message_expr: str | Column | None = None

mwojtyczka · 2026-05-08T19:44:32Z

+
+def test_dq_dataset_rule_message_expr_defaults_to_none():
+    """DQDatasetRule without message_expr should default to None."""
+    rule = DQRowRule(check_func=is_not_null_and_not_empty, column="id")


Test bug: this test claims DQDatasetRule but constructs DQRowRule.

def test_dq_dataset_rule_message_expr_defaults_to_none(): """DQDatasetRule without message_expr should default to None.""" rule = DQRowRule(check_func=is_not_null_and_not_empty, column="id") # ^^^^^^^^^ should be DQDatasetRule with a dataset-level check function assert rule.message_expr is None

The test name and docstring claim DQDatasetRule, the body uses DQRowRule. Either rename the test (it's now a duplicate of test_dq_row_rule_message_expr_defaults_to_none above), or fix the body to actually construct a DQDatasetRule with a @register_rule("dataset") check function — at which point the test exercises something new.

mwojtyczka · 2026-05-08T19:44:32Z

+        message_expr=F.lit("Custom error"),
+    )
+    rule_dict = rule.to_dict()
+    assert "message_expr" not in rule_dict


Add error-path and round-trip tests.

Current unit tests cover construction + to_dict() happy paths. Two missing categories:

1. Error paths — what happens when message_expr is malformed?

def test_apply_checks_invalid_message_expr_fails_clearly(spark): rule = DQRowRule( check_func=is_not_null, column="id", message_expr="concat(broken syntax", ) df = spark.createDataFrame([(None,)], "id: int") with pytest.raises(Exception): DQEngine(...).apply_checks(df, [rule])

Locks in the failure mode and gives users a debugging anchor.

2. Explicit round-trip — implicit only today.

def test_string_message_expr_round_trips_through_metadata(): rule = DQRowRule(..., message_expr="'static msg'") serialized = rule.to_dict() # ... deserialize, apply, assert message preserved. def test_column_message_expr_lost_on_round_trip(): rule = DQRowRule(..., message_expr=F.lit("col msg")) serialized = rule.to_dict() assert "message_expr" not in serialized # documented behaviour

mwojtyczka · 2026-05-08T19:47:09Z

+            return condition
+
+        custom_message = (
+            self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr)


Check if it is a string, to avoid problems with spark Column vs spark connect Column

feat: add message callable to DQRule for custom check messages

c28d041

Co-authored-by: Isaac

ghanse requested a review from a team as a code owner March 20, 2026 22:22

ghanse requested review from gergo-databricks and removed request for a team March 20, 2026 22:22

ghanse temporarily deployed to tool March 20, 2026 22:22 — with GitHub Actions Inactive

ghanse had a problem deploying to tool March 20, 2026 22:22 — with GitHub Actions Failure

ghanse changed the title ~~feat: add message callable to DQRule for custom check messages~~ Allow custom check messages Mar 21, 2026

Refactor

6721317

ghanse had a problem deploying to tool March 21, 2026 01:40 — with GitHub Actions Failure

ghanse temporarily deployed to tool March 21, 2026 01:40 — with GitHub Actions Inactive

ghanse had a problem deploying to tool March 21, 2026 01:40 — with GitHub Actions Failure

Update documentation and tests

ab9a4ff

ghanse had a problem deploying to tool March 23, 2026 15:19 — with GitHub Actions Failure

ghanse had a problem deploying to tool March 23, 2026 15:19 — with GitHub Actions Error

Fix tests

05baafb

mwojtyczka temporarily deployed to tool April 20, 2026 06:19 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool April 20, 2026 06:19 — with GitHub Actions Failure

mwojtyczka temporarily deployed to tool April 20, 2026 06:19 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool April 20, 2026 06:19 — with GitHub Actions Failure

mwojtyczka temporarily deployed to tool April 20, 2026 06:20 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool April 20, 2026 06:20 — with GitHub Actions Failure

Merge branch 'main' into ghanse/issue-958-custom-check-messages

10313ce

ghanse had a problem deploying to tool April 22, 2026 21:26 — with GitHub Actions Failure

ghanse had a problem deploying to tool April 22, 2026 21:27 — with GitHub Actions Failure

Merge branch 'main' into ghanse/issue-958-custom-check-messages

e13b0ef

mwojtyczka had a problem deploying to tool April 23, 2026 15:05 — with GitHub Actions Error

Merge branch 'main' into ghanse/issue-958-custom-check-messages

580a4af

mwojtyczka temporarily deployed to tool April 24, 2026 20:53 — with GitHub Actions Inactive

mwojtyczka and others added 5 commits April 28, 2026 09:58

Merge branch 'main' into ghanse/issue-958-custom-check-messages

6ec79a5

Merge branch 'main' into ghanse/issue-958-custom-check-messages

050c1d7

Update docs

52cd2df

Merge branch 'main' into ghanse/issue-958-custom-check-messages

4560ac6

mwojtyczka reviewed May 8, 2026

View reviewed changes

Conversation

ghanse commented Mar 20, 2026

Changes

Linked issues

Tests

Uh oh!

github-actions Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mwojtyczka left a comment

Choose a reason for hiding this comment

One non-inline issue: PR description doesn't match the implementation.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Mar 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading